!pip install imageio
!pip install scikit-image
Requirement already satisfied: imageio in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (2.9.0) Requirement already satisfied: numpy in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from imageio) (1.19.5) Requirement already satisfied: pillow in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from imageio) (8.1.0) Requirement already satisfied: scikit-image in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (0.17.2) Requirement already satisfied: tifffile>=2019.7.26 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (2020.9.3) Requirement already satisfied: imageio>=2.3.0 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (2.9.0) Requirement already satisfied: numpy>=1.15.1 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (1.19.5) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (3.3.4) Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (8.1.0) Requirement already satisfied: PyWavelets>=1.1.1 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (1.1.1) Requirement already satisfied: networkx>=2.0 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (2.5) Requirement already satisfied: scipy>=1.0.1 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from scikit-image) (1.4.1) Requirement already satisfied: python-dateutil>=2.1 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image) (2.4.7) Requirement already satisfied: six in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from cycler>=0.10->matplotlib!=3.0.0,>=2.0.0->scikit-image) (1.15.0) Requirement already satisfied: decorator>=4.3.0 in c:\users\marty\anaconda3\envs\newenv1\lib\site-packages (from networkx>=2.0->scikit-image) (4.4.2)
import torch
from torch import nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
import scipy.ndimage
from scipy import misc
from glob import glob
from scipy import stats
from sklearn.preprocessing import LabelEncoder, StandardScaler
import skimage
import imageio
import seaborn as sns
from PIL import Image
import glob
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
device(type='cpu')
The HAM10000 ("Human Against Machine with 10000 training images") dataset which contains 10,015 dermatoscopic images was made publically available by the Harvard database on June 2018 in the hopes to provide training data for automating the process of skin cancer lesion classifications. The motivation behind this act was to provide the public with an abundance and variability of data source for machine learning training purposes such that the results may be compared with that of human experts. If successful, the appplications would bring cost and time saving regimes to hospitals and medical professions alike.
Apart from the 10,015 images, a metadata file with demographic information of each lesion is provided as well. More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal)
You can download the dataset here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
The 7 classes of skin cancer lesions included in this dataset are:
# importing metadata and checking for its shape
data_dir = "./data/HAM10000"
metadata = pd.read_csv(data_dir + '/HAM10000_metadata.csv')
print(metadata.shape)
# label encoding the seven classes for skin cancers
le = LabelEncoder()
le.fit(metadata['dx'])
LabelEncoder()
print("Classes:", list(le.classes_))
metadata['label'] = le.transform(metadata["dx"])
metadata.sample(10)
(10015, 8) Classes: ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']
| lesion_id | image_id | dx | dx_type | age | sex | localization | dataset | label | |
|---|---|---|---|---|---|---|---|---|---|
| 5931 | HAM_0002570 | ISIC_0029952 | nv | follow_up | 30.0 | male | trunk | vidir_molemax | 5 |
| 4314 | HAM_0004948 | ISIC_0025736 | nv | follow_up | 50.0 | female | lower extremity | vidir_molemax | 5 |
| 2119 | HAM_0001069 | ISIC_0028898 | mel | histo | 60.0 | female | lower extremity | rosendahl | 4 |
| 558 | HAM_0007598 | ISIC_0026514 | bkl | histo | 65.0 | male | lower extremity | rosendahl | 2 |
| 6040 | HAM_0002288 | ISIC_0027803 | nv | follow_up | 70.0 | male | abdomen | vidir_molemax | 5 |
| 3694 | HAM_0005214 | ISIC_0031114 | nv | follow_up | 75.0 | female | lower extremity | vidir_molemax | 5 |
| 4480 | HAM_0005430 | ISIC_0024990 | nv | follow_up | 60.0 | female | trunk | vidir_molemax | 5 |
| 8809 | HAM_0005017 | ISIC_0025310 | nv | histo | 75.0 | female | back | rosendahl | 5 |
| 9734 | HAM_0007085 | ISIC_0030076 | akiec | histo | 85.0 | female | upper extremity | rosendahl | 0 |
| 791 | HAM_0003574 | ISIC_0024877 | bkl | confocal | 50.0 | male | face | vidir_modern | 2 |
# Getting a sense of what the distribution of each column looks like
fig = plt.figure(figsize=(40,25))
ax1 = fig.add_subplot(221)
metadata['dx'].value_counts().plot(kind='bar', ax=ax1)
ax1.set_ylabel('Count', size=50)
ax1.set_title('Cell Type', size = 50)
ax2 = fig.add_subplot(222)
metadata['sex'].value_counts().plot(kind='bar', ax=ax2)
ax2.set_ylabel('Count', size=50)
ax2.set_title('Sex', size=50);
ax3 = fig.add_subplot(223)
metadata['localization'].value_counts().plot(kind='bar')
ax3.set_ylabel('Count', size=50)
ax3.set_title('Localization', size=50)
ax4 = fig.add_subplot(224)
sample_age = metadata[pd.notnull(metadata['age'])]
sns.distplot(sample_age['age'], fit=stats.norm, color='red');
ax4.set_title('Age', size = 50)
ax4.set_xlabel('Year', size=50)
plt.tight_layout()
plt.show()
C:\Users\marty\anaconda3\envs\newEnv1\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
import os
import shutil
# A path to the folder which has all the images:
data_dir = os.getcwd() + "/data/HAM10000/"
image_dir = data_dir + "Images/"
# A path to the folder where you want to store the rearranged images:
dest_dir = os.getcwd() + "/data/HAM10K"
# Read the metadata file:
metadata = pd.read_csv(data_dir + '/HAM10000_metadata.csv')
label = ['bkl', 'nv', 'df', 'mel', 'vasc', 'bcc', 'akiec']
label_images = []
os.mkdir(dest_dir)
for i in label:
os.mkdir(dest_dir + "/" + str(i) + "/")
# Copy the images into new folder structure:
for i in label:
sample = metadata[metadata['dx'] == i]['image_id']
label_images.extend(sample)
for id in label_images:
# print(image_dir + id +".jpg")
# print(dest_dir + "/" + i + "/"+id+".jpg")
shutil.copyfile((image_dir + id +".jpg"), (dest_dir + "/" + i + "/"+id+".jpg"))
label_images=[]
--------------------------------------------------------------------------- FileExistsError Traceback (most recent call last) <ipython-input-12-495cc97a543c> in <module> 15 label_images = [] 16 ---> 17 os.mkdir(dest_dir) 18 for i in label: 19 os.mkdir(dest_dir + "/" + str(i) + "/") FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\marty\\Desktop\\FYP_Test\\HAM10000 pt 2\\SLClassificationAnEducationalCode-MEC2019/data/HAM10K'
As you can see there is imbalance in the number of images per class. There are much more images for the lesion type "Melanocytic Nevi" compared to other types. This is an usual occurence for medical datasets and so it is very important to analyze the data from beforehand.
#Visualizing the images
label = [ 'akiec', 'bcc','bkl','df','mel', 'nv', 'vasc']
label_images = []
classes = [ 'actinic keratoses', 'basal cell carcinoma', 'benign keratosis-like lesions',
'dermatofibroma','melanoma', 'melanocytic nevi', 'vascular lesions']
fig = plt.figure(figsize=(55, 55))
k = range(7)
for i in label:
sample = metadata[metadata['dx'] == i]['image_id'][:5]
label_images.extend(sample)
for position,ID in enumerate(label_images):
labl = metadata[metadata['image_id'] == ID]['dx']
im_sample = dest_dir + "/" + labl.values[0] + f'/{ID}.jpg'
im_sample = imageio.imread(im_sample)
plt.subplot(7,5,position+1)
plt.imshow(im_sample)
plt.axis('off')
if position%5 == 0:
title = int(position/5)
plt.title(classes[title], loc='left', size=50, weight="bold")
plt.tight_layout()
plt.show()